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Summary 


An error complexity analysis of two algorithms for solving a unit-diagonal triangular system is 
given. The results show that the usual sequential algorithm is optimal in terms of having the 
minimal maximum and cumulative error complexity measures. The parallel algorithm described 
by Sameh and Brent is shown to be essentially equivalent to the optimal sequential one. Some 
numerical experiments are also included. 
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1 . Introduction 


^3 

In [1] Samch & Brent have shown that, given — - + 0(n 2 ) processors, a triangular system of n 
equations Ax = b may be solved in 0( log 2 rt) steps. They have also shown that if jc is the computed 
solution then it satisfies the equation (/I + 3A p )x = b, where 6A P is bounded by , 
1 1 3A p 1 1 < a(rt)cK 2 (/l) 1 1 A 1 1 . Here, | | . | | stands for the oo-norm, a(n) = 0(n 2 log /i), e is the unit 
roundoff, x(/I) is the condition number of A. On the otherhand, if x is the solution computed by 
the standard sequential algorithm, then it satisfies[2] the equation {A + 3A s )x=b, where 
1 1 3A S 1 1 < ns | | A 1 1. Thus the bound on | | SA p | | can be very large compared to that on 1 1 6 A. 1 1. 

In this paper we present an alternative approach to the error analysis of these two algorithms 
and show that the parallel algorithm described by Sameh and Brent is essentially equivalent to the 
usual sequential one in terms of our error complexity measures. Some numerical experiments 
confirming the theoretical prediction are also presented. 


2. Some Preliminary Results 

Given a normalized floating-point system with a Z-digit base /? mantissa, the additive and mul- 
tiplicative operations can be modelled by the following equations [2]: 


(2.1) 


fl(x x y) = xy S 

jl(x ± y) = (x ± y) A = xA ± yA 


where 



r for rounded operation 
/? I_r for chopped operation 


and x and y are given machine floating-point numbers and fl(.) is used to denote the computed 
floating-point result of the given argument. We shall call A( or <5) the unit A( or 3 )-factor. 
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In general, one can apply (2.1) repeatedly to a sequence of division-free computational steps, 
and the computed result z can be expressed as: 

tm \ m 

( 2 . 2 ) 

where each z, is an exact product of error-free data, and A* (or <5* ) stands for the product of k 
possibly different A ( or <5)-factors. Following [3], we shall henceforth call such an exact product 
of error-free data a basic term. X(z) is then the total number of basic terms whose sum constitutes 
z. 

Note that in (2.2), the computed z is expressed as the exact sum of X(z) perturbed z/s. Thus 
the size of (or a-) is an indication of the possible number of round-off occurrences during the 
computational process. We define the following two measures: 

maximum error complexity: 

(2.3) <t(z) = max [o, + a,] 

1 <j<A(z) J J 


cumulative error complexity: 


m 

(2.4) s(z) = Yj_oj + oj-} 

J= 1 


Different algorithms used to compute the same quantity 



can then be compared using the above error complexity measures. 


From (2.3) and (2.4) we can further define the following: 


(2.5) 


a a (z) = max <T; , a m (z) = max Jr,- 
\<i<m 1 <j<i(z) J 


( 2 . 6 ) 


Hz) X(z) 

s a( z ) = Yj°J ’ ~ T/J 

/'= 1 J=\ 


Thus <y a (z), s a (z) or <r m (z), s m (z) are error complexities due to additive or multiplicative operations. 
In other words, cr a (z), s a (z) or a m (z) ,s m (z) are a (z), j(z) evaluated assuming exact multiplications or 
additions, respectively, rdso, 


a{z) < a a (z) + o m (z), 
s(z) = s a (z) + s m (z). 


Applying (2.3) and (2.4) :o (2.1), it is straightforward to establish the following lemma [4]: 


Lemma 2. 1 If z = Jl(x ± y), then 


(/) e(z) = 1 + max( <?(*), a(y)), 
s(z) - s(x) + s(y) + X(z), 
X(z) = A(*) + A(y). 


If z— fl{x x y ), then 

(it) a(z) = 1 4 - o(x) + o(y), 

s(z) = r(x)AO) + •iOUM + Hz), 
A(r) = A(x)AO). 


Often it is more convenient to express a computed result in terms of a sum of some intermediate 
results. In such cases, we have the following lemma[5]: 


Lemma 2.2 If the computed result z can be expressed as 


z 
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where each z ; is a product of intermediate results, then 


n 

Hz) = Y HZj ), Hz) = max (H + a, 4- Hz,)) 

J J 1 <i<n J J J 


n n n n 

s m( 2 ) ” ^a(^) ~ ^^ / K^/) a y 

y— i y=i j=\ y=i 


In general a basic term is of the form 


k 

X = J^x* 1 ', a,-> 1, x(Xf) = 1 
/= 1 


where each x, is a single distinct error-free data. We shall now define the multiplicative index of 
x, or /i(x), as follows: 


k 

H x ) = — 1- 

i=i 

In other words, fi(x) is simply the number of sequential multiplications needed to form x. We need 
the following lemma[5]: 

Lemma 2.3 Let 


Z=y7(x)=^nxr-j, 


then 


ff m (z) = U z ) = MW- 
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Lemma 2.3 simply states that the multiplicative error complexities are invariant to the algorithms 
used to form z, provided that only multiplications are used. We now establish the following lemma: 


Lemma 2.4 Given basic terms a , b and it is desired to form 

c =fl(a ± b), 


then 


= maxOi(a), n(b)) y 

s m( c ) = + /*(£)• 

provided only associative laws are allowed to find c and the computation of the type a + ab is not 
evaluated by a(l 4- b). 

Proof If there are no common factor between a and b t then a and b have to be evaluated sepa- 
rately before the final addition. By Lemma 2.2 we have 

/7(d) = JI(b) = b6 m . 


Hence 

c = fl(fl{a) ±Jl(b)) = ad m A ± bS M(b) S. 


By definition 

°m( c ) = mux(/i(d), u(b)), s m (c) = fi(a) + fi(b). 


I lence the lemma is true. 


If there is a common factor, say x, between a and b f then 


a = xa t b = xb y a, b # 1, 


and one might choose to compute c as 

6 
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c = fl{x(a ±b)) 

once fl(x),fl(x),fl(b) are computed. Now by (2.1) 

c = fl{x]fl(fl(a) ±fl{b))5 
= xS' u(x) (aS^ } ± bd‘ u(b) )6A 
= xaA6 M(x)+ - u( “ )+l ± xbA6 u(x)+M( * )+l 

Hence by definition 

a m (c) = ma x(jj.(x) + n(a) + 1, n(x) + f±(b) + 1) 

= max(/i(a), n(b)) 

and 

5 m( c ) = K x ) + m(u) + 1 + ^{x) + fi(b) + 1 
= m(u) + ^(i). Q.E.D. 

By repeated application of Lemma 2.4 to the evaluation of (2.2) we can easily establish the 
following theorem: 

Theorem 2. 1 The computed z of (2.2) is such that 

a m (z) = max uiz.), 

\<i<MA 

MA 

*m(z) = yv^- 

7=1 

In other words, Theorem 2. 1 states that the multiplicative error complexities are invariant to the 
algorithms used to evaluate z. Henceforth we shall only look at the additive error complexities in 
the evaluation of different algorithms for the computation of the type of (2.2). This is equivalent 
to having exact multiplication operations possible for the computation of (2.2). We need the fol- 
lowing theorem: 
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Theorem 2.2 If in (2.2) X(z) = 2* and 2* - 1 additions are used to evaluate z, then 


<* a {z)>k, s a (z)>k2 k . 

Proof The computation of z in (2.2) is equivalent to the construction of a binary tree with 2* 
leaves at the top and 2 k — l interior nodes of additions with z the output of the bottom root node. 
In such case then o a (z) is the height of the tree and j ff (z) is the sum of the lengths of all the paths 
from the leaf nodes to the root node. Q.E.D. 

An important type of computation of (2.2) is the evaluation of the inner product given as 



we need to specify the order in which the additions are executed. We discuss several strategies. 

If the products are added recursively in parallel by divide-and- conquer, then the strategy is called 
left-heavy if 


z=y7(z, +z 2 ) 


where 


*1 



z 2 =fl\ 


i 

r*/2i +i / 


Similarly the strategy is called right-heavy if 


am \ 

z =Jl(z 3 + z 4 ), Z 3 =yz/ Yj -V/ j 


*4 



s 


i= W1 +1 
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If the inner product is summed up in sequential order, then we have the common strategies of 
left-to-right or right-to-left. 

i 

i We now establish the following theorem: 

Theorem 2.3 Assuming exact multiplications are possible in evaluating the x it y ( and xj;, of 
(2.7) and 

o ^(x,) > ff a (x 2 ) > ... > a a (x k _ x ) = <y a (x k ) > 0, 

^jOl) = oM = ••• = °a(yk-\) = a aiyk) ^ 

then the computed z of (2.7) is such that 

°a( z ) = < J a( x \) + < T a(y\) + w 


where 


{ f log A] if the strategy is left-heavy, 

L log kj if the strategy is right-heavy, 
k — 1 if the strategy is left-to-right, 

1 if the strategy is right -to -left. 


Proof We first consider the last two cases. If the strategy is left-to-right , then we can easily 
obtain 


’ = A* ' + 1 + ... + -v^A 1 

Hence we have 

<*a(z) = max(<T a (x!) + <r a 0',) + k - 1, a a (x 2 ) + <j u (y 2 ) + k- 1, ... , a a (x k ) + o a (y k ) + 1) 

= <*a(x l) + o a (y\) + k ~ 1 


and the theorem is true. 
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If the strategy is right-to-left, then we have 


By (2.8) we have 


z = *,>’, A 1 + ... +x k _ { y k _ { A k ' + x^A 


k - 1 


^aC^l) — ^ — *•* — l) ^ ^ a a( x k) + ^ 2 


Hence 


°a( x l) + ^l) + 1 ^ *a(*2) + ^(>2) +2 > ... > 4- ajy k ^ x ) + * - 1 = «r a (x A-l ) + a a (y*-i) + k - 1 

And indeed 

^( z ) = < T a(^l) + ^aO ; l )+ 1 

For the parallel strategies, we prove by induction. For k = 1 the theorem is trivial. Assume it 
is true for k — 1 expressed as 

in binary form. For £ then if the strategy is left-heavy, we have 


'r*/2i * 

=M Yj x o>i+ X m 

' =i (■= r*/2i +1 


By assumption 


<? a (z) = 1 + max(<x £ , a R ) 

= <r a (x,) + a a (y x ) + 1 + flog 0/211 , 


where 


a L = * 3 (*i) + ff a 0i) + flog 0/211 
a R = <T a( - ^fk/ 2 l + 1 ) + ff a(Vr*/ 2 l +l) + f log(* - 0/2l)l 


I 
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Since 
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k - \ = 


Hence 


k - + 1 


And we have 


2 / <£<2 /+I 

2 / ~ i < m < i 


j- 1 < flog < j 


Therefore 


1 4* flog r^/2ll — j + 1 = flog k] 


So the theorem is true. Similar reasoning can be used to show the truth of the theorem for the 
right-heavy strategy. Q.E.D. 


3. Error Complexity Analysis 


Given a unit -diagonal lower triangular system 


(3.1) Ax = b 


1 

a l\ 1 


,b = 

1 

•<r -£* • • 

1 

, X = 

1 

1 

a n 1 • • 

■ a n,n - 1 ^ 


1 

* 

1 


J- J 


where 



then the exact solution x, can be expressed as 


(3.2) x, = (-l)' 1 det 

Thus the evaluation of x t is equivalent to the evaluation of the determinant of an / by i lower 
Hessenberg matrix with unity super-diagonal elements. We assume the given A and b are error free 
original data with X(a 0 ) = X{b) = 1 and ii{a^ = p(b) = 0 . Denoting by t ( the generic computation 
of such a matrix, then it is easily shown (by expanding the first row of the above determinant) that 

(3.3) t, =Ah-\ + w h- 1)< ^( w ) = 1. m(w) = 0 

where w is error-free. It is obvious that 

/(/,.) = 2/t(f i _ 1 ), ;.(/,) = l, 


b, l 

&2 U21 1 

b'i a i\ • ■ • a i,i— 1 1 


hence 

(3.4) A(tj) = 2 <_l . 

Furthermore we have the following lemma 

Lemma 3. 1 The computation of t i requires at most 2 i_1 — 1 additive operations. 

Proof Denoting a ( as the number of additive operations needed to compute ; it is obvious from 
(3.3) that 

a t = 2 a t _ x + 1, a x = 0. 

The solution is given as 
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This proves the lemma. 


Q.E.D. 
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Lemma 3.1 together with (3.4) and Theorem 2.2 gives us the following theorem: 

Theorem 3.1 The computation of x { requires at most a total of 2 ,_1 — 1 additive operations 

with 


<y a (Xj) > i - 1, 
s a (X') > (/- 


The solution of (3.1) can also be expressed as the following: 


(3.5) 


x = M n _ l M l 


v ..M 2 M x b 


where 


Ml = / — afii , 


a t * [0(1:0. a i+ij > ••• > a nil T 


and e t is the /-th column of the identity matrix /. Note in the above expression we use a(h:c) to 
denote a sub-vector of identical component a placed in the b- th to c-th positions of a larger vector. 


The usual sequential algorithm can then be expressed as 


x {0) = b 

for i = 1 to n — 1 do 


or more specifically, 


(3.6) 

i 


for j — 1 to n do 

x^ = b- 

Xj Uj 

for / = 1 to /i do 

v. = v<'-'> 


for j = 1 to i do 

j‘) _ y.(«— 1) 

Xj Xj 

for j = / 4- 1 to n do 
JO _ . 




(/-I) 


) 


We have the following lemma: 

Lemma 3.2 The computation of is equivalent to the computation of 

< (0 = ['i h - h W' + 1: «)] r 

i 

Proof We only need to notice that the inner loop computation of (3.6) is essentially of the type 
fl(t l + t x t) which is t i+l by definition. Q.E.D. 

i 

Applying Lemma 2.1 (i) to the inner loop of (3.6) and assuming exact multiplications, we have 
for j > i 

I 

a-aixft = 1 + max(<T a (xj' _l) ), <T a (.x,)) 

1 s a( x P) = s ai x j‘- ‘ *) + s J x i) + M x f) M x i ) = 1 , i a (ar, ) = <J a (x l ) = 0 

'-a( x j l) ) = + M x i) 

I 

I 

The solutions to the above equations are given in the following theorem: 

Theorem 3.2 The sequential algorithm of (3.6) produces results such that for j > i 

) = /, S a (xf>) = i2 i , A(xf) = 2'', 

‘ ~ 1 . *„(*<) = V ~ A(x t ) = i~\ 
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Comparing the results in the above theorem to those in Theorem 3.1, we conclude that the 
sequential algorithm is optimal in terms of having the minimal maximum and cumulative error 


complexities. 

We now turn our attention to the parallel algorithm as proposed by Sameh and Brent[l]. The 
algorithm for n — 2 V is given as follows: 

for / = 1 to 2 V — 1 do 
.V//°> = Mi 
b<V = b 

for j = 0 to v — 1 do 

^ for k = 2 V— ■ /_1 — 1 downto 1 do 

b« + "=J 

x = b (v) 

First the X matrix of A/£°> can easily be obtained as: 

0) ) = l + X(a k )e[ 


where 




Then we have the following theorem: 
Theorem 3.3 


;.(.v/^' +,) ) = x(.\ 4l+x)Wt?k) = 


A(L (/+,) ) 

m4 +1) ) 



X(b U+i) ) = XiM^Xib 01 ) = [1 2 1 2 2 ... 2 i ' +1 _, (2 /+1 :/i)] 7 ' 


where 


p = ki +X - 1, q — 2 v — p — 2 /+1 


/(L (/+l) ) = 




. 2 ° 1 


;.(4' +1> ) = 


;.(Pf +1) )_ 



q rows 


and x(L^ +I) ) and a(V^ 1) ) are the first 2 /+l and the last q — 2 J+l rows of A(/?£ +l> ) , respectively. 

Proof See Appendix I. 

If we define <j(A) as a matrix whose (ij )- th component is <r(a y ) then we have the following 
theorem: 


Theorem 3.4 a(A/£ vl) ) is of the same structure as that of Furthermore let 

a gh = the (£,A)-th element of , then we have 


f >0if£ — /i>2| 
| = 0 otherwise J 


p+ 1 < h<p+ 2 /+1 , 


/> + 1 <g£2 v -p 


°g,P + 1 > °g<P+2 > — > <r ^ J>+ 2 /+1 


CT />+l,A < < ••• < a p+2 / ^',h ~ ~ a 2 , -p,h 

Op+lj, = < *;>+2./> = a a( x 2) < - < a p+2/^p = - = a 2 v —pjp = 


Proof See Appendix II. 

With the general property established in Theorem 3.4, we have the following theorem for 

a(x): 

Theorem 3.5 If in (3.7) the inner products are evaluated using either the left-heavy or the 
right-heavy strategy, then the computed x is such that 

16 



<r a (Xi)< 1.5(/— 1) 


Proof At the (j+ l)-th step let us denote by m gh the (g,h)th component of Mf. Now by con- 
struction 


r -A® 
X 7! +i ~ +i’ 


\<i<i 


To calculate 


b U+l) = 


we have 


/ 2 / ’+<— 1 


/,(/.+ !) 


/=/ 7 ^ "v+u^ + V+/ 




i < / < ^ 


For / — 2 ; ’ we have 


jcy+i 



where the summation (inner product) is evaluated by either the left-heavy or the right-heavy parallel 
strategy. Now by assumption 


“b 1 — ^ — 2^ 


Also 


ff a ("V +1 li !c)' > a a( m 2/^ 1 jt +1 ). 2^+1 < ^ < 2 /+1 - 2 


<7 a(^2 y + 1 > 2 / * + 1 ) 
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Hence by Theorem 2.3 we have 
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o a (b%t V) = ojxjj+i) = ejjnp ijf +1 ) + trJjCj/) +j = 2 a a (xj) +j, a a (x { ) = 0. 


The solution to the above equation is given as 

a a (.r^i) = 3(2 / ) -y - 2, 0 <y + 1 < v 

For general i we also have by Theorem 2.3 that 

o a (x^j + .) = a a (Xi) + o a (x^j) + | log(i +1)| 

where | . I is used to denote either HI or L J ■ 

Now if 

/ = ArAr-1 •••/?() 

A,= l, A/e {0,1}, 0 </< r — 1 


then <r a (x t ) can further be expressed as 


<rJ x i) = Pr a a( x 2 ') + a a( x i-pX) + /U log(/ + 1 - /?/) | 

= Pr ff a( x 2') "F P r—\ a j( X 2 r ~ l ) F ff a( x i— p X _ p r _X~ 1 ^ 

+ P r | log(/ + 1 - P r 2 r ) I + p r _ x I log(/ + 1 - /f r 2' - P r _ x 2 r ~ l ) | 

r 

= ... < YjPjOaix-/) + a a( x p 0 2°) 

/= 1 

+ /? r | log 2 r | + p r _ x I log 2 r_I | +...+/?, | log 2 1 1 

r r 

< - 1) = J// 2 '' + - l)<y(/- 1) Q.E.D. 

/=1 /=! 


If the inner products are evaluated in a sequential manner, then similar reasoning can be used 
to establish the following theorem: 

Theorem 3.6 If in (3.7) the inner products are evaluated in sequence, then the computed x is 


such that 



(/) if the strategy is left-to-right, then 


<*a( x i) < f log /I (2 ^ los ^ ’) 

(it) if the strategy is right -to -left, then 

°a( x i) = 1 ~ 1 

(g- h - 1 if g - h > 2 | p + 1 < h < p + 2/\ 
[0 otherwise J p+l<g<2 v -p 


where o gk is used to denote the (g,h)-th element of and p — £2' vl — 1. 


We can now summarize the results as follows: 


(3.8) 



— / — 1 If the strategy is right-to-left or the algorithm of (3.6) is used, 

< 1.5(z— 1) if the strategy is left-heavy or right-heavy, 

< flog /l(2 rios/1 ! ) if the strategy is left-to-right. 


The cumulative error complexities can then be bounded using (3.8) as follows: 


(3.9) s a (x t ) 


= (/ — 1)2 Z 1 If the strategy is right-to-left or the algorithm of (3.6) is used, 

< 1 . 5(i — 1)2 ;_1 if the strategy is left-heavy or right-heavy, 

< 2 l ~ l f log i\ (2 ^ io§ ^ ! ) if the strategy' is left-to-right. 


We conclude that the parallel algorithm (3.7) is as accurate as the sequential algorithm (3.6) if 
the parallel inner products are evaluated using the strategy' of right -to -left. For other strategies we 
can easily obtain from (3.8) and (3.9) that 

o a (xi) resulting from (3.7) f 1.5 if the strategy is left-heavy or- right- heavy, 

<? a ( x i) resulting from (3.6) “ | flog f| if the strategy is left-to-right. 

s a (xfi .resulting from (3.7) f 1.5 if the strategy is left-heavy or right-heavy, 
s a ( x i) resulting from (3.6) “ (flog/] if the strategy is left-to-right. 


Hence in all cases the parallel algorithm is essentially 'equivalent' to the usual sequential algorithm 
in terms of our error complexity measures. 


4. Numerical Experiments and Conclusion 


In the first experiment a 64 by 64 lower triangular system satisfying 


Xi+x = 4^ - + 1, x { = 1, * 2 = 5 


is solved in Pascal shortreal using an IBM 370 machine. The unit round-off is 16~ 5 . If we denote 
by e seq {x l ) and e par (x t ) , respectively, the absolute error of x t produced by the sequential and parallel 
algorithm, then a selected sample of errors is shown below: 


k 


£par( x 4k) 

1 

0 

0 

2 

0 

0 

3 

0 

0 

4 

3.68£02 

8.80£02 

5 

2.26£05 

2.91 £05 

6 

8.52£07 

5. 16£07 

7 

2.15£10 

6.99£09 

8 

4.50£12 

1.48£12 

9 

1.02£15 

3.02£14 

10 

2.40£17 

5.97£16 

11 

5.10£19 

1.07£19 

12 

1.08£22 

2.58£21 

13 

2.29 E24 

4.78£23 

14 

4.75£26 

1.07£26 

15 

1.04£29 

2.47 £28 

16 

2.38£31 

4.80£30 


For the second experiment a set of 100 random lower triangular matrices A with unit diagonal 
elements and 100 vectors b are generated such that 


0>a f y>— 1, 0<bi< 1, 1 < ij < 64, i —j> 1 . 


The systems are solved using an IBM PC with an 8087 coprocessor. The unit round-off is 2 -23 . 

The cumulative absolute error of all *, produced by the sequential and parallel algorithm are re- 
presented by ce seq (x t ) and ce par (x) y respectively. A selected set of errors is given below: 

20 



k 

ce seqi. x ik) 

ce paA x ik) 

1 

5.16£ — 6 

3.88£ — 6 

2 

3.52£ — 5 

3. 16£ — 5 

3 

1.83£ — 4 

1 .9 1 £ — 4 

4 

9.85£ — 4 

1.02£ — 3 

5 

5.76£ — 3 

5.37£ — 3 

6 

3.29£ — 2 

2.15E-2 

7 

1.75£ — 1 

1.74£ — 1 

8 

8.45£ - 1 

7.42£ — 1 

9 

4.29£00 

3. 47 £00 

10 

1.79£01 

l.85£01 

11 

9.43 £01 

9. 1 0£0 1 

12 

5.69£02 

5. 12£02 

13 

3. 12£03 

2.36£03 

14 

1.57£04 

1.67E04 

15 

8.90£04 

6.61 £04 

16 

4.61 £05 

4.67£05 
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We see from the above tables that the numerical results produced by the parallel algorithm are 
as accurate as those produced by the usual sequential algorithm. In the first experiment the parallel 
results can even be classified as slightly 'better' than the sequential ones. 
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Appendix I. Proof of Theorem 3.3 
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To show the validity of 


(.Tl) 




we first show that for general matrix multiplication 


C = fl(A x B) y AzR mxn , BeR nxr , 


C e R k 


we have 


X{Q = X{A)k{B). 


By definition 



then we have, by repeated application of Lemma 2.1, 


n 


k=\ 


So we have 


m = mm- 


The validity of (A. 1) can then be shown by direct substitution of the results for 
Km+x) and A(M$) and into (A.l). Q.E.D. 
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Appendix II. Proof of Theorem 3.4 


First of all , the nontrivial part of o a (M^ X) ) is of the same structure as that of X{\ Fur- 
thermore, the diagonals and subdiagonals of consist of only one basic term each. Hence 

no additive operations are involved. And we have 
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Let us assume that the rest of the theorem is true for M $ +1 and A/$ . Now 
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where 


zl +1 > = 


41 

A4k + 1 k 2k) 41 + 1 


4 +x) = u %4l+x 41 + 41) 41 + >]• 


The submatrices UH, m+\ in L%~ 1) and R$ +l in R^ l) will retain the same properties as stated in the 
theorem. Let 


x = m41+i 41), y =ar ( &+x 41 + 4h 


and 


l=41+v r=41+v u =4l v=4i 
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Then 
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1 < m,n < 2? 


with 


a JJm\) ^ ^ *** ^ a a(4n/i)» 

a a( w lrt) " a a( w 2rt) “ ■" = °a( u mn) = a n > °n+ 1’ 


Hence by Theorem 2.3 we have 


°a( x mn) = + °n + w ( m ) 

< ^(Wl, l) + ^ + 1) = **(*«+!,«) 


where 


{ T log ml if the strategy is left-heavy, 
Llog m\ if the strategy is right-heavy, 
k — 1 if the strategy is left-to-right, 

1 if the strategy is right-to-left. 


Also 


a a( x mn) ^ 1 ) w(rri) — ^aC^m/i+l) 


by assumption. Hence the ordered property for the a ' s in L£’ +1) is preserved. A similar argument 
can also be used to show the same property is true for the matrix Q.E.D. 
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